Perceptual synthesis of auditory scenes

ABSTRACT

An auditory scene is synthesized by applying two or more different sets of one or more spatial parameters (e.g., an inter-ear level difference (ILD), inter-ear time difference (ITD), and/or head-related transfer function (HRTF)) to two or more different frequency bands of a combined audio signal, where each different frequency band is treated as if it corresponded to a single audio source in the auditory scene. In one embodiment, the combined audio signal corresponds to the combination of two or more different source signals, where each different frequency band corresponds to a region of the combined audio signal in which one of the source signals dominates the others. In this embodiment, the different sets of spatial parameters are applied to synthesize an auditory scene comprising the different source signals.

CROSS-REFERENCE TO RELATED APPLICATIONS

This is a continuation of co-pending application Ser. No. 09/848,877,filed on May 4, 2001 as attorney docket no. Faller 5, the teachings ofwhich are incorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to the synthesis of auditory scenes, thatis, the generation of audio signals to produce the perception that theaudio signals are generated by one or more different audio sourceslocated at different positions relative to the listener.

2. Description of the Related Art

When a person hears an audio signal (i.e., sounds) generated by aparticular audio source, the audio signal will typically arrive at theperson's left and right ears at two different times and with twodifferent audio (e.g., decibel) levels, where those different times andlevels are functions of the differences in the paths through which theaudio signal travels to reach the left and right ears, respectively. Theperson's brain interprets these differences in time and level to givethe person the perception that the received audio signal is beinggenerated by an audio source located at a particular position (e.g.,direction and distance) relative to the person. An auditory scene is thenet effect of a person simultaneously hearing audio signals generated byone or more different audio sources located at one or more differentpositions relative to the person.

The existence of this processing by the brain can be used to synthesizeauditory scenes, where audio signals from one or more different audiosources are purposefully modified to generate left and right audiosignals that give the perception that the different audio sources arelocated at different positions relative to the listener.

FIG. 1 shows a high-level block diagram of conventional binaural signalsynthesizer 100, which converts a single audio source signal (e.g., amono signal) into the left and right audio signals of a binaural signal,where a binaural signal is defined to be the two signals received at theeardrums of a listener. In addition to the audio source signal,synthesizer 100 receives a set of spatial parameters corresponding tothe desired position of the audio source relative to the listener. Intypical implementations, the set of spatial parameters comprises aninteraural level difference (ILD) value (which identifies the differencein audio level between the left and right audio signals as received atthe left and right ears, respectively) and an interaural time delay(ITD) value (which identifies the difference in time of arrival betweenthe left and right audio signals as received at the left and right ears,respectively). In addition or as an alternative, some synthesistechniques involve the modeling of a direction-dependent transferfunction for sound from the signal source to the eardrums, also referredto as the head-related transfer function (HRTF). See, e.g., J. Blauert,The Psychophysics of Human Sound Localization, MIT Press, 1983, theteachings of which are incorporated herein by reference.

Using binaural signal synthesizer 100 of FIG. 1, the mono audio signalgenerated by a single sound source can be processed such that, whenlistened to over headphones, the sound source is spatially placed byapplying an appropriate set of spatial parameters (e.g., ILD, ITD,and/or HRTF) to generate the audio signal for each ear. See, e.g., D. R.Begault, 3-D Sound for Virtual Reality and Multimedia, Academic Press,Cambridge, Mass., 1994.

Binaural signal synthesizer 100 of FIG. 1 generates the simplest type ofauditory scenes: those having a single audio source positioned relativeto the listener. More complex auditory scenes comprising two or moreaudio sources located at different positions relative to the listenercan be generated using an auditory scene synthesizer that is essentiallyimplemented using multiple instances of binaural signal synthesizer,where each binaural signal synthesizer instance generates the binauralsignal corresponding to a different audio source. Since each differentaudio source has a different location relative to the listener, adifferent set of spatial parameters used to generate the binaural audiosignal for each different audio source.

FIG. 2 shows a high-level block diagram of conventional auditory scenesynthesizer 200, which converts a plurality of audio source signals(e.g., a plurality of mono signals) into the left and right audiosignals of a single combined binaural signal, using a different set ofspatial parameters for each different audio source. The left audiosignals are then combined (e.g., by simple addition) to generate theleft audio signal for the resulting auditory scene, and similarly forthe right.

One of the applications for auditory scene synthesis is in conferencing,Assume, for example, a desktop conference with multiple participants,each of whom is sitting in front of his or her own personal computer(PC) in a different city. In addition to a PC monitor, eachparticipant's PC is equipped with (1) a microphone that generates a monoaudio source signal corresponding to that participant's contribution tothe audio portion of the conference and (2) a set of headphones forplaying that audio portion. Displayed on each participant's PC monitoris the image of a conference table as viewed from the perspective of aperson sitting at one end of the table. Displayed at different locationsaround the table are real-time video images of the other conferenceparticipants.

In a conventional mono conferencing system, a server combines the monosignals from all of the participants into a single combined mono signalthat is transmitted back to each participant. In order to make morerealistic the perception for each participant that he or she is sittingaround an actual conference table in a room with the other participants,the server can implement an auditory scene synthesizer, such assynthesizer 200 of FIG. 2, that applies an appropriate set of spatialparameters to the mono audio signal from each different participant andthen combines the different left and right audio signals to generateleft and right audio signals of a single combined binaural signal forthe auditory scene. The left and right audio signals for this combinedbinaural signal are then transmitted to each participant. One of theproblems with such conventional stereo conferencing systems relates totransmission bandwidth, since the server has to transmit a left audiosignal and a right audio signal to each conference participant.

SUMMARY OF THE INVENTION

The present invention is directed to a technique for synthesizingauditory scenes that addresses the transmission bandwidth problem of theprior art. According to the present invention, an auditory scenecorresponding to multiple audio sources located at different positionsrelative to the listener is synthesized from a single combined (e.g.,mono) audio signal. As such, in the case of the conference describedpreviously, a solution can be implemented in which each participant's PCreceives only a single mono audio signal corresponding to a combinationof the mono audio source signals from all of the participants.

The present invention is based on an assumption that, for thosefrequency bands in which the energy of the source signal from aparticular audio source dominates the energies of all other sourcesignals in the combined audio signal, from the perspective of theperception by the listener, the combined audio signal can be treated asif it corresponded solely to that particular audio source. According toimplementations of the present invention, different sets of spatialparameters (corresponding to different audio sources) are applied todifferent frequency bands in the combined audio signal where differentaudio sources dominate, to synthesize an auditory scene.

In one embodiment, the present invention is a method for synthesizing anauditory scene, comprising the steps of (a) dividing an input audiosignal into a plurality of different frequency bands; and (b) applyingtwo or more different sets of one or more spatial parameters to two ormore of the different frequency bands in the input audio signal togenerate two or more synthesized audio signals of the auditory scene,wherein for each of the two or more different frequency bands, thecorresponding set of one or more spatial parameters is applied to theinput audio signal as if the input audio signal corresponded to a singleaudio source in the auditory scene.

In another embodiment, the present invention is an apparatus forsynthesizing an auditory scene, comprising (1) an auditory scenesynthesizer configured to (a) divide an input audio signal into aplurality of different frequency bands; and (b) apply two or moredifferent sets of one or more spatial parameters to two or more of thedifferent frequency bands in the input audio signal to generate two ormore synthesized audio signals of the auditory scene, wherein for eachof the two or more different frequency bands, the corresponding set ofone or more spatial parameters is applied to the input audio signal asif the input audio signal corresponded to a single audio source in theauditory scene; and (2) one or more inverse time-frequency transformersconfigured to convert the two or more synthesized audio signals from afrequency domain into a time domain.

In yet another embodiment, the present invention is a method forprocessing two or more input audio signals, comprising the steps of (a)converting the two or more input audio signals from a time domain into afrequency domain; (b) generating a set of one or more auditory sceneparameters for each of two or more different frequency bands in the twoor more converted input audio signals, where each set of one or moreauditory scene parameters is generated as if the corresponding frequencyband corresponded to a single audio source in an auditory scene; and (c)combining the two or more input audio signals to generate a combinedaudio signal.

In yet another embodiment, the present invention is an apparatus forprocessing two or more input audio signals, comprising (a) atime-frequency transformer configured to convert the two or more inputaudio signals from a time domain into a frequency domain; (b) anauditory scene parameter generator configure to generate a set of one ormore auditory scene parameters for each of two or more differentfrequency bands in the two or more converted input audio signals, whereeach set of one or more auditory scene parameters is generated as if thecorresponding frequency band corresponded to a single audio source in anauditory scene; and (c) a combiner configured to combine the two or moreinput audio signals to generate a combined audio signal.

BRIEF DESCRIPTION OF THE DRAWINGS

Other aspects, features, and advantages of the present invention willbecome more fully apparent from the following detailed description, theappended claims, and the accompanying drawings in which:

FIG. 1 shows a high-level block diagram of conventional binaural signalsynthesizer that converts a single audio source signal (e.g., a monosignal) into the left and right audio signals of a binaural signal;

FIG. 2 shows a high-level block diagram of conventional auditory scenesynthesizer that converts a plurality of audio source signals (e.g., aplurality of mono signals) into the left and right audio signals of asingle combined binaural signal;

FIG. 3 shows a block diagram of a conferencing system, according to oneembodiment of the present invention;

FIG. 4 shows a block diagram of the audio processing implemented by theconference server of FIG. 3, according to one embodiment of the presentinvention;

FIG. 5 shows a flow diagram of the processing implemented by theauditory scene parameter generator of FIG. 4, according to oneembodiment of the present invention;

FIG. 6 shows a graphical representation of the power spectra of theaudio signals from three different exemplary sources;

FIG. 7 shows a block diagram of the audio processing performed by eachconference node in FIG. 3;

FIG. 8 shows a graphical representation of the power spectrum in thefrequency domain for the combined signal generated from the three monosource signals in FIG. 6;

FIG. 9 shows a representation of the analysis window for thetime-frequency domain, according to one embodiment of the presentinvention; and

FIG. 10 shows a block diagram of the transmitter for an alternativeapplication of the present invention, according to one embodiment of thepresent invention.

DETAILED DESCRIPTION

FIG. 3 shows a block diagram of a conferencing system 300, according toone embodiment of the present invention. Conferencing system 300comprises conference server 302, which supports conferencing between aplurality of conference participants, where each participant uses adifferent conference node 304. In preferred embodiments of the presentinvention, each node 304 is a personal computer (PC) equipped with amicrophone 306 and headphones 308, although other hardwareconfigurations are also possible. Since the present invention isdirected to processing of the audio portion of conferences, thefollowing description omits reference to the processing of the videoportion of such conferences, which involves the generation,manipulation, and display of video signals by video cameras, videosignal processors, and digital monitors that would be included inconferencing system 300, but are not explicitly represented in FIG. 3.The present invention can also be implemented for audio-onlyconferencing.

As indicated in FIG. 3, each node 304 transmits a (e.g., mono) audiosource signal generated by its microphone 306 to server 302, where thatsource signal corresponds to the corresponding participant'scontribution to the conference. Server 302 combines the source signalsfrom the different participants into a single (e.g., mono) combinedaudio signal and transmits that combined signal back to each node 304.(Depending on the type of echo-cancellation performed, if any, thecombined signal transmitted to each node 304 may be either unique tothat node or the same as the combined signal transmitted to every othernode.) In addition to the combined signal, server 302 transmits anappropriate set of auditory scene parameters to each node 304. Each node304 applies the set of auditory scene parameters to the combined signalin a manner according to the present invention to generate a binauralsignal for rendering by headphones 308 and corresponding to the auditoryscene for the conference.

The processing of conference server 302 may be implemented within adistinct node of conferencing system 300. Alternatively, the serverprocessing may be implemented in one of the conference nodes 304, oreven distributed among two or more different conference nodes 304.

FIG. 4 shows a block diagram of the audio processing implemented byconference server 302 of FIG. 3, according to one embodiment of thepresent invention. As shown in FIG. 4, auditory scene parametergenerator 402 generates one or more sets of auditory scene parametersfrom the plurality of source signals generated by and received from thevarious conference nodes 304 of FIG. 3. In addition, signal combiner 404combines the plurality of source signals (e.g., using straightforwardaudio signal addition) to generate the combined signal that istransmitted back to each conference node 304.

FIG. 5 shows a flow diagram of the processing implemented by auditoryscene parameter generator 402 of FIG. 4, according to one embodiment ofthe present invention. Generator 402 applies a time-frequency (TF)transform, such as a discrete Fourier transform (DFT), to convert eachnode's source signal to the frequency domain (step 502 of FIG. 5).Generator 402 then compares the power spectra of the different convertedsource signals to identify one or more frequency bands in which theenergy one of the source signals dominates all of the other signals(step 504).

Depending on the implementation, different criteria may be applied todetermine whether a particular source signal dominates the other sourcesignals. For example, a particular source signal may be said to dominateall of the other source signals when the energy of that source signalexceeds the sum of the energies in the other source signals by either aspecified factor or a specified amount of power (e.g., in dBs).Alternatively, a particular source signal may be said to dominate whenthe energy of that source signal exceeds the second most powerful sourcesignal by a specified factor or a specified amount of power. Othercriteria are, of course, also possible, including those that combine twoor more different comparisons. For example, in addition to relativedomination, a source signal might have to have an absolute energy levelthat exceeds a specified energy level before qualifying as a dominatingsource signal.

FIG. 6 shows a graphical representation of the power spectra of theaudio signals from three different exemplary sources (labeled A, B, andC). FIG. 6 identifies eight different frequency bands in which one ofthe three source signals dominates the other two. Note that, in FIG. 6,there are particular frequency ranges in which none of the three sourcesignals dominate. Note also that the lengths of the dominated frequencyranges (i.e., frequency ranges in which one of the source signalsdominates) are not uniform, but rather are dictated by thecharacteristics of the power spectra themselves.

Returning to FIG. 5, after generator 402 identifies one or morefrequency bands in which one of the source signals dominates, a set ofauditory scene parameters is generated for each frequency band, wherethose parameters correspond to the node whose source signal dominatesthat frequency band (step 506). In some implementations, the processingof step 506 implemented by generator 402 generates the actual spatialparameters (e.g., ILD, ITD, and/or HRTF) for each dominated frequencyband. In those cases, generator 402 receives (e.g., a priori)information about the relative spatial placement of each participant inthe auditory scene to be synthesized (as indicated in FIG. 4). Inaddition to the combined signal, at least the following auditory sceneparameters are transmitted to each conference node 304 of FIG. 3 foreach dominated frequency band:

-   -   (1) Frequency of the start of the frequency band;    -   (2) Frequency of the end of the frequency band; and    -   (3) One or more spatial parameters (e.g., ILD, ITD, and/or HRTF)        for the frequency band.        Although the identity of the particular node/participant whose        source signal dominates the frequency band can be transmitted,        such information is not required for the subsequent synthesis of        the auditors scene. Note that, for those frequency bands, for        which no source signal is determined to dominate, no auditory        scene parameters or other special information needs to be        transmitted to the different conference nodes 304.

In other implementations, the generation of the spatial parameters foreach dominated frequency band is implemented independently at eachconference node 304. In those cases, generator 402 does not need anyinformation about the relative spatial placements of the variousparticipants in the synthesized auditory scene. Rather, in addition tothe combined signal, only the following auditory scene parameters needto be transmitted to each conference node 304 for each dominatedfrequency band:

-   -   (1) Frequency of the start of the frequency band;    -   (2) Frequency of the end of the frequency band; and    -   (3) Identity of the node/participant whose source signal        dominates the frequency band.        In such implementations, each conference node 304 is responsible        for generating the appropriate spatial parameters for each        dominated frequency range. Such implementation enables each        different conference node to generate a unique auditory scene        (e.g., corresponding to different relative placements of the        various conference participants within the synthesized auditory        scene).

In either type of implementation, the processing of FIG. 5 is preferablyrepeated at a specified interval (e.g., once for every 20-msec frame ofaudio data). As a result, the number and definition of the dominatedfrequency ranges as well as the particular source signals that dominatethose ranges will typically vary over time (e.g., from frame to frame),reflecting the fact that the set of conference participants who arespeaking at any given time will vary over time as will thecharacteristics of their own individual voices (e.g., intonations and/orvolumes). Depending on the implementation, the spatial parameterscorresponding to each conference participant may be either static (e.g.,for synthesis of stationary participants whose relative positions do notchange over time) or dynamic (e.g., for synthesis of mobile participantswho relative positions are allowed to change over time).

In alternative embodiments, rather than selecting a set of spatialparameters that corresponds to a single source, a set of spatialparameters can be generated that reflects the contributions of two ormore—or even all—of the participants. For example, weighted averagingcan be used to generate an ILD value that represents the relativecontributions for the two or more most dominant participants. In suchcases, each set of spatial parameters is a function of the relativedominance of the most dominant participants for a particular frequencyband.

FIG. 7 shows a block diagram of the audio processing performed by eachconference node 304 in FIG. 3 to convert a single combined mono audiosignal and corresponding auditory scene parameters received fromconference server 302 into the binaural signal for a synthesizedauditory scene. In particular, time-frequency (TF) transform 702converts each frame of the combined signal into the frequency domain.

For each dominated frequency band, auditory scene synthesizer 704applies the corresponding auditory scene parameters to the convertedcombined signal to generate left and right audio signals for thatfrequency band in the frequency domain. In particular, for each audioframe and for each dominated frequency band, synthesizer 704 applies theset of spatial parameters corresponding to the participant whose sourcesignal dominates the combined signal for that dominated frequency range.If the auditory scene parameters received from the conference server donot include the spatial parameters for each conference participant, thensynthesizer 704 receives information about the relative spatialplacement of the different participants in the synthesized auditoryscene as indicated in FIG. 7, so that the set of spatial parameters foreach dominated frequency band in the combined signal can be generatedlocally at the conference node.

An inverse TF transform 706 is then applied to each of the left andright audio signals to generate the left and right audio signals of thebinaural signal in the time domain corresponding to the synthesizedauditory scene. The resulting auditory scene is perceived as beingapproximately the same as for an ideally synthesized binaural signalwith the same corresponding spatial parameters but applied over thewhole spectrum of each individual source signal.

FIG. 8 shows a graphical representation of the power spectrum in thefrequency domain for the combined signal generated from the three monosource signals from sources A, B, and C in FIG. 6. In addition toshowing the three different source signals (dotted lines), FIG. 8 alsoshows the same frequency bands identified in FIG. 6 in which the powerof one of the three source signals dominates the other two. It is tothese dominated frequency bands to which auditory scene synthesizer 704applies appropriate sets of spatial parameters.

In a typical audio frame, not all of the conference participants willdominate at least one frequency band, since not all of the participantswill typically be talking at the same time. If only one participant istalking, then only that participant will typically dominate any of thefrequency bands. By the same token, during an audio frame correspondingto relative silence, it may be that none of the participants willdominate any frequency bands. For those frequency bands for which nodominating participant is identified, no spatial parameters are appliedand the left and right audio signals of the resulting binaural signalfor those frequency bands are identical.

Time-Frequency Transform

As indicated above, TF transform 702 in FIG. 7 converts the combinedmono audio signal to the spectral (i.e., frequency) domain frame-wise inorder for the system to operate for real-time applications. For eachfrequency band n at each time k (e.g., frame number k), a leveldifference ΔL_(n)[k], a time difference τ_(n)[k], and/or an HRTF is tobe introduced into the underlying audio signal. In a preferredembodiment, TF transform 702 is a DFT-based transform, such as thosedescribed in A. V. Oppenheim and R. W. Schaefer, Discrete-Time SignalProcessing, Signal Processing Series, Prentice Hall, 1989, the teachingsof which are incorporated herein by reference. The transform is derivedbased on the desire for the ability to synthesize frequency-dependentand time-adaptive time differences τ_(n)[k]. The same transform can beused advantageously for the synthesis of frequency-dependent andtime-adaptive level differences ΔL_(n)[k] and for HRTFs.

When W samples s₀ . . . , s_(W−1) in the time domain are converted to Wsamples S₀. . . S_(W−1) in a complex spectral domain with a DFTtransform, then a circular time-shift of d time-domain samples can beobtained by modifying the W spectral values according to Equation (1) asfollows: $\begin{matrix}{{\hat{S}}_{n} = {S_{n}{{\mathbb{e}}^{\frac{2\pi\quad{nd}}{W}}.}}} & (1)\end{matrix}$In order to introduce a non-circular time-shift within each frame (asopposed to a circular time-shift), the time-domain samples s₀ . . . ,s_(W−1) are padded with zeros at the beginning and at the end of theframe and a DFT of size N=2Z+W is then used. By modifying the resultingspectral coefficients, a non-circular time-shift within the rangedε[−Z,Z] can be implemented by modifying the resulting N spectralcoefficients according to Equation (2) as follows: $\begin{matrix}{{\hat{S}}_{n} = {S_{n}{{\mathbb{e}}^{\frac{2\pi\quad{nd}}{N}}.}}} & (2)\end{matrix}$

The described scheme works as long as the time-shift d does not vary intime. Since the desired d usually varies over time, the transitions aresmoothed by using overlapping windows for the analysis transform. Aframe of N samples is multiplied with the analysis window before anN-point DFT is applied. The following Equation (3) shows the analysiswindow, which includes the zero padding at the beginning and at the endof the frame: $\begin{matrix}\begin{matrix}{{w_{a}\lbrack k\rbrack} = 0} & {for} & {k < Z} \\{{w_{a}\lbrack k\rbrack} = {\sin^{2}\left( \frac{\left( {k - Z} \right)\pi}{W} \right)}} & {for} & {Z \leq k < {Z + W}} \\{{w_{a}\lbrack k\rbrack} = 0} & {for} & {{Z + W} \leq k}\end{matrix} & (3)\end{matrix}$where Z is the width of the zero region before and after the window. Thenon-zero window span is W, and the size of the transform s N=2Z+W.

FIG. 9 shows a representation of the analysis window, which was chosensuch that it is additive to one when windows of adjacent frames areoverlapped by W/2 samples. The time-span of the window shown in FIG. 9is shorter than the DFT length such that non-circular time-shifts withinthe range [−Z,Z] are possible. To gain more flexibility in changing timedifferences, level differences, and HRTFs in time and frequency, ahigher factor of oversampling can be used by choosing the time-span ofthe window to be smaller and/or by overlapping the windows more.

The zero padding of the analysis window shown in FIG. 9 allows theimplementation of convolutions with HRTFs as simple multiplications inthe frequency domain. Therefore, the transform is also suitable for thesynthesis of HRTFs in addition to time and level differences. A moregeneral and slightly different point of view of a similar transform isgiven by J. B. Allen, “Short-term spectral analysis, synthesis andmodification by discrete fourier transform, ” IEEE Trans. on Speech andSignal Processing, vol. ASSP-25, pp. 235-238, June 1977, the teachingsof which are incorporated herein by reference.

Obtaining A Binaural Signal From A Mono Signal

In certain implementations, auditory scene synthesizer 704 of FIG. 7applies different sets of specified level and time differences to thedifferent dominated frequency bands in the combined signal to generatethe left and right audio signals of the binaural signal for thesynthesized auditory scene. In particular, for each frame k, eachdominated frequency band n is associated with a level differenceΔL_(n)[k] and a time difference τ_(n)[k]. In preferred embodiments,these level and time differences are applied symmetrically to thespectrum of the combined signal to generate the spectra of the left andright audio signals according to Equations (4) and (5), respectively, asfollows: $\begin{matrix}{{S_{n}^{L} = {\frac{10^{\frac{\Delta\quad L_{n}}{10}}}{\sqrt{1 + 10^{\frac{\Delta\quad L_{n}}{10}}}}S_{n}{\mathbb{e}}^{- \frac{2\pi\quad n\quad\tau_{n}}{2N}}}}{and}} & (4) \\{S_{n}^{R} = {\frac{1}{\sqrt{1 + 10^{\frac{2\Delta\quad L_{n}}{10}}}}S_{n}{\mathbb{e}}^{- \frac{2\pi\quad n\quad\tau_{n}}{2N}}}} & (5)\end{matrix}$where {S_(n)} are the spectral coefficients of the combined signal and{S_(n) ^(L)} and {S_(n) ^(R)} are the spectral coefficients of theresulting binaural signal. The level differences {ΔL_(n)} are expressedin dB and the time differences {τ_(n)} in numbers of samples.

For the spectral synthesis of auditory scenes based on HRTFs, the leftand right spectra of the binaural signal may be obtained using Equations(6) and (7), respectively, as follows: $\begin{matrix}{{S_{n}^{L} = {\sum\limits_{m = 1}^{M}{w_{m,n}H_{m,n}^{L}S_{n}}}}{and}} & (6) \\{S_{n}^{R} = {\sum\limits_{m = 1}^{M}{w_{m,n}H_{m,n}^{R}S_{n}}}} & (7)\end{matrix}$where H_(m,n) ^(L) and H_(m,n) ^(R) are the complex frequency responsesof the HRTFs corresponding to the sound source m. For each spectralcoefficient, a weighted sum of the frequency responses of the HRTFs ofall sources is applied with weights w_(m,n). The level differencesΔL_(n), time differences τ_(n), and HRTF weights w_(m,n) are preferablysmoothed in frequency and time to prevent artifacts.Experimental Results

To evaluate how useful the present invention is for a desktopconferencing application, twelve participants were given a task whichrequired responding to one of two simultaneous voice messages. This is avariation of the “cocktail party problem” of attending to one voice inthe presence of others. The signals were presented to the participantswith headphones in an acoustically isolated room. Five different signalkinds were tested for their effect on the ability to respond to one oftwo simultaneous messages:

-   -   Test 1: diotic: a mono signal to both ears    -   Test 2: ILD_(p): an ideally synthesized binaural signal with        ILDs    -   Test 3: ITD_(p): an ideally synthesized binaural signal with        ITDs    -   Test 4: ILD_(p): a binaural signal perceptually synthesized with        ILDs using the present invention    -   Test 5: ITD_(p): a binaural signal perceptually synthesized with        ITDs using the present invention        Each of the participants took all of the tests in randomized        order.

The tests used the speech corpus introduced in R. S. Bolia, W. T.Nelson, M. A. Ericson, and B. D. Simpson, “A speech corpus formultitalker communications research,” J. Acoust. Soc. Am., vol. 107, no.2, pp. 1065-1066, February 2000, the teachings of which are incorporatedherein by reference. Similar tests have also been conducted by others,such as reported in R. S. Bolia, M. A. Ericson, W. T. McKinley, and B.D. Simpson, “A cocktail party effect in the median plane?,” J. Acoust.Soc. Am., vol. 105, pp. 1390-1391, 1999, and W. Spieth, J. F. Curtis,and J. C. Webster, “Responding to one of two simultaneous messages,” J.Acoust. Soc. Am., vol. 26, no. 3, pp. 391-396, 1954, the teachings ofboth of which are incorporated herein by reference.

A typical sentence of the corpus is “READY LAKER, GO TO BLUE FIVE NOW,”where LAKER is the call sign and BLUE FIVE is a color-numbercombination. Combinations of the eight different call signs, fourdifferent colors, and eight different numbers were chosen randomly withthe restriction that the call sign assigned to the participant occurredin 50% of the cases.

In the tests, each participant was instructed to respond when his or hercall sign was called by indicating the color-number combination by thetalker who called the call sign. One out of four female talkers wasrandomly chosen for each of the two talkers in each test item. Onetalker was spatially placed at the right side and the other at the leftside for Tests 2 and 4 (ILD=±16 dB) and for Tests 3 and 5 (ITD=±500μsec). Each of the five tests consisted of 20 test items which werepreceded by 10 training items.

Table I shows the results for the case when the listeners were called bytheir call signs. The upper row shows the percentage of correctidentification of the call sign, and the lower row shows the conditionalpercentage of the correct color-number combination given that thelistener's call sign was correctly identified. These results suggestthat the percentages of correct identification of the call sign and ofthe color and number significantly improve for ideally synthesizedbinaural signals (Tests 2 and 3) or perceptually synthesized binauralsignals (Tests 4 and 5) over the diotic signal (Test 1), with theperceptually synthesized signals of Tests 4 and 5 being almost as goodas the ideally synthesized signals of Tests 2 and 3. For the cases whenthe listeners were not called, the percentages of the listenersresponding was below two percent for all five tests. TABLE 1 Test 1 Test2 Test 3 Test 4 Test 5 call sign 70% 78% 85% 77% 78% color-number 64%98% 88% 96% 91%Alternative Embodiments

In the previous sections, the present invention was described in thecontext of a desktop conferencing application. The present invention canalso be employed for other applications. For example, the presentinvention can be applied where the input is a binaural signalcorresponding to an (actual or synthesized) auditory scene, rather thanthe input being individual mono source signals as in the previousapplication. In this latter application, the binaural signal isconverted into a single mono signal and auditory scene parameters (e.g.,sets of spatial parameters). As in the desktop conferencing application,this application of the present invention can be used to reduce thetransmission bandwidth requirements for the auditory scene since,instead of having to transmit the individual left and right audiosignals for the binaural signal, only a single mono signal plus therelatively small amount of spatial parameter information need to betransmitted to a receiver, where the receiver performs processingsimilar to that shown in FIG. 7.

FIG. 10 shows a block diagram of transmitter 1000 for such anapplication, according to one embodiment of the present invention. Asshown in FIG. 10, a TF transform 1002 is applied to corresponding framesof each of the left and right audio signals of the input binaural signalto convert the signals to the frequency domain. Auditory scene analyzer1004 processes the converted left and right audio signals in thefrequency domain to generate a set of auditory scene parameters for eachof a plurality of different frequency bands in those converted signals.In particular, for each corresponding pair of audio frames, analyzer1004 divides the converted left and right audio signals into a pluralityof frequency bands. Depending on the implementation, each of the leftand right audio signals can be divided into the same number of equallysized frequency bands. Alternatively, the size of the frequency bandsmay vary with frequency, e.g., larger frequency bands for higherfrequencies or smaller frequency bands for higher frequencies.

For each corresponding pair of frequency bands, analyzer 1004 comparesthe converted left and right audio signals to generate one or morespatial parameters (e.g., an ILD value, an ITD value, and/or an HRTF).In particular, for each frequency band, the cross-correlation betweenthe converted left and right audio signals is estimated. The maximumvalue of the cross-correlation, which indicates how much the two signalsare correlated, can be used as a measure for the dominance of one sourcein the band. If there is 100% correlation between the left and rightaudio signals, then only one source's energy is dominant in thatfrequency band. The less the cross-correlation maximum is, the less isjust one source dominant. The location in time of the maximum of thecross-correlation can be used to correspond to the ITD. The ILD can beobtained by computing the level difference of the power spectral valuesof the left and right audio signals. In this way, each set of spatialparameters is generated by treating the corresponding frequency range asif it were dominated by a single source signal. For those frequencybands where this assumption is true, the generated set of spatialparameters will be fairly accurate. For those frequency bands where thisassumption is not true, the generated set of spatial parameters willhave less physical significance to the actual auditory scene. On theother hand, the assumption is that those frequency bands contribute lesssignificantly to the overall perception of the auditory scene. As such,the application of such “less significant” spatial parameters will havelittle if any adverse affect on the resulting auditory scene. In anycase, transmitter 1000 transmits these auditory scene parameters to thereceiver for use in reconstructing the auditory scene from the monoaudio signal.

Auditory scene remover 1006 combines the converted left and right audiosignals in the frequency domain to generate the mono audio signal. In abasic implementation, remover 1006 simply averages the left and rightaudio signals. In preferred implementations, however, more sophisticatedprocessing is performed to generate the mono signal. In particular, forexample, the spatial parameters generated by auditory scene analyzer1004 can be used to modify both the left and right audio signals in thefrequency domain as part of the process of generating the mono signal,where each different set of spatial parameters is used to modify acorresponding frequency band in each of the left and right audiosignals. For example, if the generated spatial parameters include an ITDvalue for each frequency band, then the left and right audio signals ineach frequency band can be appropriately time shifted using thecorresponding ITD value to make the ITD between the left and right audiosignals become zero. The power spectra for the time-shifted left andright audio signals can then be added such that the perceived loudnessof each frequency band is the same in the resulting mono signal as inthe original binaural signal.

An inverse TF transform 1008 is then applied to the resulting mono audiosignal in the frequency domain to generate the mono audio signal in thetime domain. The mono audio signal can then be compressed and/orotherwise processed for transmission to the receiver. Since a receiverhaving a configuration similar to that in FIG. 7 converts the mono audiosignal back into the frequency domain, the possibility exists foromitting inverse TF transform 1008 of FIG. 10 and TF transform 702 ofFIG. 7, where the transmitter transmits the mono audio signal to thereceiver in the frequency domain.

As in the previous application, the receiver applies the receivedauditory scene parameters to the received mono audio signal tosynthesize (or, in this latter case, reconstruct an approximation of)the auditory scene. Note that, in is latter application, there is noneed for any a priori knowledge of either the number of sources involvedin the original auditory scene or their relative positions. In thislatter application, there is no identification of particular sourceswith particular frequency bands. Rather, the frequency bands areselected in an open-loop manner, but processed with the same underlyingassumption as the previous application: that is, that each frequencyband can be treated as if it corresponded to a single source using acorresponding set of spatial parameters.

Although this latter application has been described in the context ofprocessing in which the input is a binaural signals, this application ofthe present invention can be extended to (two or multi-channel) stereosignals. Similarly, although the invention has been described in thecontext of systems that generate binaural signals corresponding toauditory scenes perceived using headphones, the present invention can beextended to apply to the generation of (two or multi-channel) stereosignals for loudspeaker playback.

The present invention may be implemented as circuit-based processes,including possible implementation on a single integrated circuit. Aswould be apparent to one skilled in the art, various functions ofcircuit elements may also be implemented as processing steps in asoftware program. Such software may be employed in, for example, adigital signal processor, micro-controller, or general-purpose computer.

The present invention can be embodied in the form of methods andapparatuses for practicing those methods. The present invention can alsobe embodied in the form of program code embodied in tangible media, suchas floppy diskettes, CD-ROMs, hard drives, or any other machine-readablestorage medium, wherein, when the program code is loaded into andexecuted by a machine, such as a computer, the machine becomes anapparatus for practicing the invention. The present invention can alsobe embodied in the form of program code, for example, whether stored ina storage medium, loaded into and/or executed by a machine, ortransmitted over some transmission medium or carrier, such as overelectrical wiring or cabling, through fiber optics, or viaelectromagnetic radiation, wherein, when the program code is loaded intoand executed by a machine, such as a computer, the machine becomes anapparatus for practicing the invention. When implemented on ageneral-purpose processor, the program code segments combine with theprocessor to provide a unique device that operates analogously tospecific logic circuits.

It will be further understood that various changes in the details,materials, and arrangements of the parts which have been described andillustrated in order to explain the nature of this invention may be madeby those skilled in the art without departing from the scope of theinvention as expressed in the following claims.

1. A method for synthesizing an auditory scene, comprising the steps of:(a) dividing an input audio signal into a plurality of differentfrequency bands; and (b) applying two or more different sets of one ormore spatial parameters to two or more of the different frequency bandsin the input audio signal to generate two or more synthesized audiosignals of the auditory scene, wherein for each of the two or moredifferent frequency bands, the corresponding set of one or more spatialparameters is applied to the input audio signal as if the input audiosignal corresponded to a single audio source in the auditory scene.